Cloudflow - enabling faster biomedical pipelines with MapReduce and Spark

نویسندگان

Lukas Forer

Enis Afgan

Hansi Weißensteiner

Davor Davidovic

Günther Specht

Florian Kronenberg

Sebastian Schönherr

چکیده

For many years Apache Hadoop has been used as a synonym for processing data in the MapReduce fashion. However, due to the complexity of developing MapReduce applications, adoption of this paradigm in genetics has been limited. To alleviate some of the issues, we have previously developed Cloudflow a high-level pipeline framework that allows users to create sophisticated biomedical pipelines using predefined code blocks while the framework automatically translates those into the MapReduce execution model. With the introduction of the YARN resource management layer, new computational processing models such as Apache Spark are now plugable into the Hadoop ecosystem. In this paper we describe the extension of Cloudflow to support Apache Spark without any adaptions to already implemented pipelines. The described performance evaluation demonstrates that Spark can bring an additional boost for analysing next generation sequencing (NGS) data to the field of genetics. The Cloudflow framework is open source and freely available at https://github.com/genepi/cloudflow.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frameworks hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuffle, execution model, and caching, by using a s...

متن کامل

CloudFlow: A data-aware programming model for cloud workflow applications on modern HPC systems

Traditional High-Performance Computing (HPC) based big-data applications are usually constrained by having to move large amount of data to compute facilities for real-time processing purpose. Modern HPC systems, represented by High-Throughput Computing (HTC) andMany-Task Computing (MTC) platforms, on the other hand, intend to achieve the long-held dream of moving compute to data instead. This k...

متن کامل

MPIgnite: An MPI-Like Language and Prototype Implementation for Apache Spark

Scale-out parallel processing based on MPI is a 25-year-old standard with at least another decade of preceding history of enabling technologies in the High Performance Computing community. Newer frameworks such as MapReduce, Hadoop, and Spark represent industrial scalable computing solutions that have received broad adoption because of their comparative simplicity of use, applicability to relev...

متن کامل

Cluster Computing Paradigms– A Comparative study of Evolving Frameworks

Cluster computing is an approach for storing and processing huge amount of data that is being generated. Hadoop and Spark are the two cluster computing platforms which are prominent today. Hadoop incorporates the MapReduce concept and is scalable as well as fault-tolerant. But the limitations of Hadoop paved way for another cluster computing framework named Spark. It is faster and can also mana...

متن کامل

Large Scale Distributed Data Science from scratch using Apache Spark 2.0

Apache Spark is an open-source cluster computing framework. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in d...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

Scalable Computing: Practice and Experience

دوره 17 شماره

صفحات -

تاریخ انتشار 2016

Cloudflow - enabling faster biomedical pipelines with MapReduce and Spark

نویسندگان

چکیده

منابع مشابه

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

CloudFlow: A data-aware programming model for cloud workflow applications on modern HPC systems

MPIgnite: An MPI-Like Language and Prototype Implementation for Apache Spark

Cluster Computing Paradigms– A Comparative study of Evolving Frameworks

Large Scale Distributed Data Science from scratch using Apache Spark 2.0

عنوان ژورنال:

اشتراک گذاری